Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve non-missingness during non-inner joins #1316

Merged
merged 2 commits into from
Dec 18, 2017

Conversation

alyst
Copy link
Contributor

@alyst alyst commented Dec 13, 2017

Currently, if you do a few joins, most of the columns in your data frame become "missable" (>: Missing).
That's because so far join preserves non-missingness only for inner joins:

  • inner join doesn't introduce new missing values => non-missingness of columns from both left and right frames should be preserved.

This PR adds the other logical rules:

  • left join doesn't introduce new missing values in the left columns => non-missingness of the left frame should be preserved
  • right join doesn't introduce new missing values in the right columns => non-missingness of the right frame should be preserved
  • joins do not introduce missing values to the on-columns => non-missingness of the on columns should be preserved if both left-on and right-on columns are non-missing

The rules above are "type stable", so to say. I.e. they don't depend on the contents of the frames.
But I thought it also makes sense to introduce data-dependent rules (there's no performance penalty in checking them):

  • if all right rows have matching left rows (i.e. no missing values are introduced to the left columns) => non-missingness of the left columns is preserved
  • if all left rows have matching right rows (i.e. no missing values are introduced to the right columns) => non-missingness of the right columns is preserved

As a reasonable side effect of the rules implementation, when doing the right join, the eltype (and levels, for categorical arrays) of the right on-column has priority over the left on-column.

@alyst
Copy link
Contributor Author

alyst commented Dec 13, 2017

Travis Mac CI stalled during Julia image download. other tests ok

@cjprybol
Copy link
Contributor

Last time we considered this we voted against it because we were still using NullableArrays and we wanted to consistently interact with the columns using get and f.() dot-broadcasting, right? But now we're using the Missings approach and will interact with columns the same way whether they have missing data or not. And there will be downstream performance gains (at least in the near future) if we can avoid introducing the Union{T, Missing} and just keep T, so the extra work in join probably makes sense in many situations. Thanks for putting together this PR!

@alyst alyst mentioned this pull request Dec 13, 2017
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agree about the "type-stable" part, but I'm not a fan of changing the returned type depending on the data. Apart from making the behavior hard to predict, it won't work with a fully type-stable variant of DataFrame that I've considered supporting as an alternative. Do you really think it's common to make a left/right join without any missing values? The point of these operations is that you expect some rows to be missing from the second table, or you can use an inner join instead.

_similar = kind == :inner ? similar : similar_missing
# inner and left joins preserve non-missingness of the left frame
# it is also preserved if all right rows have left matches
_similar_left = kind == :inner || kind == :left || length(rightonly_ixs.join) == 0 ? similar : similar_missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the condition is very long, better use if ...; _similar_left = ...

on_col_ix = findfirst(joiner.left_on, names(joiner.dfl)[i])
if on_col_ix > 0 && kind == :right
# if right join, construct the on-column
# using the right frame to preserve missingness and cat.levels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's "cat.levels"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categorical levels :) I'll expand it

rcol = joiner.dfr_on[on_col_ix]
cols[i] = similar(rcol, nrow)
copy!(cols[i], view(rcol, all_orig_right_ixs))
permute!(cols[i], right_perm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an alternative solution which doesn't involve calling permute!? I'm concerned about the performance impact compared with the previous code which simply called copy!.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case only for the right joins. Previously it was handled later in the code (if length(rightonly_ixs) > 0 block).
Actually, now it should be faster, because we compose the resulting column using the right frame directly, instead of initializing it using the left frame and fixing it later.

@alyst
Copy link
Contributor Author

alyst commented Dec 13, 2017

I'm not a fan of changing the returned type depending on the data. Apart from making the behavior hard to predict, it won't work with a fully type-stable variant of DataFrame that I've considered supporting as an alternative. Do you really think it's common to make a left/right join without any missing values? The point of these operations is that you expect some rows to be missing from the second table, or you can use an inner join instead.

The problem of the inner join is that it "swallows" the rows that do not have matches, and checking for that is more expensive (performance- and code size-wise) than the introduction of missing values and checking the resulting column type.
I'm also a little bit concerned about "type-instability". I was thinking whether we can introduce some switch to join() that specifies the missingness behaviour (don't have a good name for it ATM):

  • type-stable: no data-dependent rules
  • with data-dependent rules
  • strict: the original non-missingness is preserved, so if we are doing e.g. left join and some left rows have no matching right rows, but some right columns don't allow missing values, join() throws MissingException (so that's inner join with no "swallowed" rows allowed)

@nalimilan
Copy link
Member

The problem of the inner join is that it "swallows" the rows that do not have matches, and checking for that is more expensive (performance- and code size-wise) than the introduction of missing values and checking the resulting column type.

It's not an issue if we use an Array{Union{T, Missing}} internally, as long as we take care of calling disallowmissing before returning it.

I'm also a little bit concerned about "type-instability". I was thinking whether we can introduce some switch to join() that specifies the missingness behaviour (don't have a good name for it ATM):

type-stable: no data-dependent rules
with data-dependent rules
strict: the original non-missingness is preserved, so if we are doing e.g. left join and some left rows have no matching right rows, but some right columns don't allow missing values, join() throws MissingException (so that's inner join with no "swallowed" rows allowed)

Given that we have disallowmissing!, I don't really see the advantage of providing a keyword argument to achieve the same result. And the "strict" case sounds unusual to me, as left/right joins are supposed to generate missing values anyway. I don't think any other software supports this kind of option, which could be confusing.

@alyst
Copy link
Contributor Author

alyst commented Dec 13, 2017

It's not an issue if we use an Array{Union{T, Missing}} internally, as long as we take care of calling disallowmissing before returning it.

I'm not sure I follow. By "internally" you mean within join()?
Or do you mean calling disallowmissing() after join() (from the user code)?
That's what I'm doing right now (after the left join) in my scripts to verify that each left row has right row match.
That's fine for the check (it requires just one right column), but if I want all non-missing right columns to stay non-missing, that's a few lines of dataschema-specific code.

Given that we have disallowmissing!, I don't really see the advantage of providing a keyword argument to achieve the same result.

See the usecase above, the keyword is equivalent to applying disallowmissing to all relevant columns (that were not of type Union{T,Missing} in the source frame and do not contain missing in the result).
I was thinking whether it could be implemented as a separate method, but it would require all 3 frames (left, right, result), so using it will look just like ugly join postprocessing.

And the "strict" case sounds unusual to me, as left/right joins are supposed to generate missing values anyway. I don't think any other software supports this kind of option, which could be confusing.

In SQL you have "INSERT INTO tbl SELECT ...", which checks that the result of SELECT fits the constraints of "tbl".
In data frame world we don't have data schemas, so the strict option is just a poor man's replacement (the result fits the constraints imposed by the source frames).
But I agree it would be confusing. Actually, since it's just inner join + the check that each row has a match, an alternative is to provide it as a new join kind (kind=:strict).

EDIT: I was wrong that "strict" is just a specific type of "inner" join. There could also be "strict left" (all left rows should have matching right rows) and "strict right" joins.

@alyst
Copy link
Contributor Author

alyst commented Dec 14, 2017

I've updated the commit removing the data-dependent rules.
It looks like there is consensus regarding the rest of the PR, so doesn't make sense to block it.
Data-dependent rules deserve a separate PR (I'll submit it as soon as this one is merged) and separate discussion.

@alyst
Copy link
Contributor Author

alyst commented Dec 14, 2017

CI passes on 0.6

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Indeed, better merge the uncontroversial parts first an discuss more problematic ones later.

for (on_col_ix, on_col) in enumerate(joiner.left_on)
# fix the result of the rightjoin by taking the nonmissing values from the right table
offset = nrow - length(rightonly_ixs.orig) + 1
copy!(res[on_col], offset, view(joiner.dfr_on[on_col_ix], rightonly_ixs.orig))
end
end
if kind == :outer && !isempty(rightonly_ixs.join)
# some non-missing on-columns may have become missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"missing column" can be confusing as it's not clear whether it's the column which might be missing. Better say something like "columns allowing for missing" even if it's a bit long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. In the long run would be nice to have a term for it (even the ugly internal one, which would be used in cases like this).

for (on_col_ix, on_col) in enumerate(joiner.left_on)
LT = eltype(joiner.dfl_on[on_col_ix])
RT = eltype(joiner.dfr_on[on_col_ix])
if Missings.T(LT) === LT && Missings.T(RT) === RT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this just be !(LT >: Missing || RT >: Missing)? That's more explicit IMHO. It could also be worth adding a special case for Any columns, for which there's no point in calling disallowmissing since it won't have any effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because Any >: Missing. But you're right, in this context we don't need to fix Any columns. Will fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Missings.T(Any) === Any too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, so the code would have called disallowmissing, which is wrong.
Since !(Any >: Missing || T >: Missing) === false, it would not be called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a testset for joins with Any eltypes.

test/join.jl Outdated
@@ -346,8 +343,8 @@ module TestJoin
@test levels(join(B, A, on=:b, kind=:inner)[:b]) == ["a", "b", "c"]
@test levels(join(A, B, on=:b, kind=:left)[:b]) == ["d", "c", "b", "a"]
@test levels(join(B, A, on=:b, kind=:left)[:b]) == ["a", "b", "c"]
@test levels(join(A, B, on=:b, kind=:right)[:b]) == ["d", "c", "b", "a"]
@test levels(join(B, A, on=:b, kind=:right)[:b]) == ["a", "b", "d", "c"]
@test levels(join(A, B, on=:b, kind=:right)[:b]) == ["a", "b", "c"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change? The docstring for join gives guarantees about the levels, so either it wasn't correct before or it might need to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR the types and categorical levels of the right on-columns have priority over the left ones.
I think it's a logical thing. Also it's easier to implement.
So I can either patch the PR to the previous behaviour or update the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it's not too hard to implement, I'd choose the rule which is the simplest to explain in the docstring, which is a good indication that it's easy to understand. IIUC, the new behavior you propose is to give priority to the left column, except for kind=:right? I'm not too fan of exceptions, even if I see the justification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted to the old category levels ordering.

- left and right joins preserve non-missingness of the left and right
frame columns, respectively
- non-missingness of the on-columns is preserved (if non-missing
  in both frames)
@alyst
Copy link
Contributor Author

alyst commented Dec 18, 2017

squashed, in case you want to merge it preserving the commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants